import pandas as pd
df = pd.read_excel('BDMAESTRIA_ORIGINAL.xlsx')
df.head(1)
| Numero fur | Fecha inicio | Codigo resultado | Secuencia | Marca | Linea | Modelo | Fecha matricula | Combustible | Cilindraje | ... | Gases ralenti hc | Gases ralenti co2 | Gases ralenti o2 | Gases crucero rpm | Gases crucero co | Gases crucero hc | Gases crucero co2 | Gases crucero o2 | Nivel de contaminacion | Clasificacion nivel de contaminacion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 387.0 | 2019-05-04 11:49:00 | 1 | 2.0 | VOLKSWAGEN | [VOLKSWAGEN] JETTA | 2009 | 2009-05-06 | GASOLINA | 2000 | ... | 2.0 | 15.4 | 0.0 | 2477 | 0.0 | 4.0 | 15.3 | 0.0 | 0.4 | 1 |
1 rows × 26 columns
import numpy as np
for col in df:
if df[col].dtype==np.object:
df[col] = df[col].fillna('None') # llenar los espacios en blanco con ninguno/None
else:
df[col] = df[col].fillna(0) # las variables de tipo distinto de object llenar con ceros
secuencia = df['Secuencia'] == 1.0
df=df[secuencia]
df= df.drop(['Secuencia'], axis=1)
df.reset_index(drop=True, inplace=True)
#Variable Dummy Marcas
for i in range(0,len(df)):
if df.iloc[i,3] != 'RENAULT' and df.iloc[i,3] != 'MAZDA' and df.iloc[i,3] != 'CHEVROLET' and df.iloc[i,3] != 'TOYOTA' and df.iloc[i,3] != 'HYUNDAI':
df.iloc[i,3] = 'OTRA'
print(df.shape)
df.head(1)
(9226, 25)
| Numero fur | Fecha inicio | Codigo resultado | Marca | Linea | Modelo | Fecha matricula | Combustible | Cilindraje | Kilometraje | ... | Gases ralenti hc | Gases ralenti co2 | Gases ralenti o2 | Gases crucero rpm | Gases crucero co | Gases crucero hc | Gases crucero co2 | Gases crucero o2 | Nivel de contaminacion | Clasificacion nivel de contaminacion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 382.0 | 2019-05-04 08:19:00 | 2 | OTRA | [VOLKSWAGEN] JETTA | 2009 | 2009-05-06 | GASOLINA | 2000 | 47951 | ... | 2.0 | 15.4 | 0.0 | 2477 | 0.0 | 4.0 | 15.3 | 0.0 | 0.4 | 1 |
1 rows × 25 columns
from dataprep.eda import create_report
report = create_report(df)
report
| Number of Variables | 25 |
|---|---|
| Number of Rows | 9226 |
| Missing Cells | 0 |
| Missing Cells (%) | 0.0% |
| Duplicate Rows | 0 |
| Duplicate Rows (%) | 0.0% |
| Total Size in Memory | 3.7 MB |
| Average Row Size in Memory | 423.8 B |
| Numerical | 19 |
|---|---|
| DateTime | 2 |
| Categorical | 4 |
numerical
| Distinct Count | 8335 |
|---|---|
| Unique (%) | 90.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 5305.1853 |
| Minimum | 0 |
| Maximum | 11969 |
| Zeros | 85 |
| Zeros (%) | 0.9% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 188 |
| Q1 | 1701.75 |
| Median | 5182 |
| Q3 | 8588.5 |
| 95-th Percentile | 11273.75 |
| Maximum | 11969 |
| Range | 11969 |
| IQR | 6886.75 |
| Mean | 5305.1853 |
|---|---|
| Standard Deviation | 3716.859 |
| Variance | 1.3815e+07 |
| Sum | 4.8946e+07 |
| Skewness | 0.1383 |
| Kurtosis | -1.3051 |
| Coefficient of Variation | 0.7006 |
datetime
| Distinct Count | 9214 |
|---|---|
| Unique (%) | 99.9% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 72.2 KB |
| Minimum | 2018-10-29 10:15:00 |
| Maximum | 2021-05-07 07:13:00 |
numerical
| Distinct Count | 3 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 1.3014 |
| Minimum | 1 |
| Maximum | 3 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 1 |
|---|---|
| 5-th Percentile | 1 |
| Q1 | 1 |
| Median | 1 |
| Q3 | 2 |
| 95-th Percentile | 2 |
| Maximum | 3 |
| Range | 2 |
| IQR | 1 |
| Mean | 1.3014 |
|---|---|
| Standard Deviation | 0.4799 |
| Variance | 0.2303 |
| Sum | 12007 |
| Skewness | 1.1307 |
| Kurtosis | -0.07679 |
| Coefficient of Variation | 0.3688 |
categorical
| Distinct Count | 6 |
|---|---|
| Unique (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 645.9 KB |
| Mean | 6.6907 |
|---|---|
| Standard Deviation | 1.9688 |
| Median | 7 |
| Minimum | 4 |
| Maximum | 9 |
| 1st row | OTRA |
|---|---|
| 2nd row | OTRA |
| 3rd row | RENAULT |
| 4th row | TOYOTA |
| 5th row | OTRA |
| Count | 61728 |
|---|---|
| Lowercase Letter | 0 |
| Space Separator | 0 |
| Uppercase Letter | 61728 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
categorical
| Distinct Count | 981 |
|---|---|
| Unique (%) | 10.6% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 757.8 KB |
| Mean | 19.0966 |
|---|---|
| Standard Deviation | 4.4992 |
| Median | 18 |
| Minimum | 7 |
| Maximum | 42 |
| 1st row | [VOLKSWAGEN] JETTA... |
|---|---|
| 2nd row | [NISSAN] MARCH |
| 3rd row | [RENAULT] SANDERO |
| 4th row | [TOYOTA] COROLLA |
| 5th row | [AUDI] A3 |
| Count | 132649 |
|---|---|
| Lowercase Letter | 131 |
| Space Separator | 18290 |
| Uppercase Letter | 132518 |
| Dash Punctuation | 125 |
| Decimal Number | 6388 |
numerical
| Distinct Count | 71 |
|---|---|
| Unique (%) | 0.8% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 2004.5066 |
| Minimum | 1900 |
| Maximum | 2020 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 1900 |
|---|---|
| 5-th Percentile | 1987 |
| Q1 | 1998 |
| Median | 2007 |
| Q3 | 2012 |
| 95-th Percentile | 2015 |
| Maximum | 2020 |
| Range | 120 |
| IQR | 14 |
| Mean | 2004.5066 |
|---|---|
| Standard Deviation | 9.9386 |
| Variance | 98.7757 |
| Sum | 1.8494e+07 |
| Skewness | -2.3741 |
| Kurtosis | 14.2 |
| Coefficient of Variation | 0.004958 |
datetime
| Distinct Count | 3474 |
|---|---|
| Unique (%) | 37.6% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 72.2 KB |
| Minimum | 1900-06-01 00:00:00 |
| Maximum | 2099-12-29 00:00:00 |
categorical
| Distinct Count | 4 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 658.8 KB |
| Mean | 8.1242 |
|---|---|
| Standard Deviation | 0.8971 |
| Median | 8 |
| Minimum | 8 |
| Maximum | 21 |
| 1st row | GASOLINA |
|---|---|
| 2nd row | GASOLINA |
| 3rd row | GASOLINA |
| 4th row | GASOLINA |
| 5th row | GASOLINA |
| Count | 74414 |
|---|---|
| Lowercase Letter | 0 |
| Space Separator | 364 |
| Uppercase Letter | 74414 |
| Dash Punctuation | 176 |
| Decimal Number | 0 |
numerical
| Distinct Count | 277 |
|---|---|
| Unique (%) | 3.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 1742.9323 |
| Minimum | 0 |
| Maximum | 40000 |
| Zeros | 10 |
| Zeros (%) | 0.1% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 1000 |
| Q1 | 1300 |
| Median | 1597 |
| Q3 | 1975 |
| 95-th Percentile | 3000 |
| Maximum | 40000 |
| Range | 40000 |
| IQR | 675 |
| Mean | 1742.9323 |
|---|---|
| Standard Deviation | 1152.9343 |
| Variance | 1.3293e+06 |
| Sum | 1.608e+07 |
| Skewness | 13.3424 |
| Kurtosis | 299.1645 |
| Coefficient of Variation | 0.6615 |
numerical
| Distinct Count | 7094 |
|---|---|
| Unique (%) | 76.9% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 160118.468 |
| Minimum | 0 |
| Maximum | 5.462e+06 |
| Zeros | 469 |
| Zeros (%) | 5.1% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0 |
| Q1 | 78912.75 |
| Median | 125800 |
| Q3 | 193438.75 |
| 95-th Percentile | 391470.25 |
| Maximum | 5.462e+06 |
| Range | 5.462e+06 |
| IQR | 114526 |
| Mean | 160118.468 |
|---|---|
| Standard Deviation | 186583.2677 |
| Variance | 3.4813e+10 |
| Sum | 1.4773e+09 |
| Skewness | 10.9834 |
| Kurtosis | 242.3674 |
| Coefficient of Variation | 1.1653 |
numerical
| Distinct Count | 208 |
|---|---|
| Unique (%) | 2.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 21.3565 |
| Minimum | 5.4 |
| Maximum | 41.5 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 5.4 |
|---|---|
| 5-th Percentile | 17.3 |
| Q1 | 20.2 |
| Median | 20.4 |
| Q3 | 23.1 |
| 95-th Percentile | 25.5 |
| Maximum | 41.5 |
| Range | 36.1 |
| IQR | 2.9 |
| Mean | 21.3565 |
|---|---|
| Standard Deviation | 2.6074 |
| Variance | 6.7987 |
| Sum | 197034.7 |
| Skewness | 0.8216 |
| Kurtosis | 3.6441 |
| Coefficient of Variation | 0.1221 |
numerical
| Distinct Count | 591 |
|---|---|
| Unique (%) | 6.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 56.5416 |
| Minimum | 30 |
| Maximum | 89.9 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 30 |
|---|---|
| 5-th Percentile | 36.7 |
| Q1 | 46.8 |
| Median | 54.9 |
| Q3 | 65.9 |
| 95-th Percentile | 80.275 |
| Maximum | 89.9 |
| Range | 59.9 |
| IQR | 19.1 |
| Mean | 56.5416 |
|---|---|
| Standard Deviation | 13.1283 |
| Variance | 172.3513 |
| Sum | 521653.2 |
| Skewness | 0.3171 |
| Kurtosis | -0.6573 |
| Coefficient of Variation | 0.2322 |
categorical
| Distinct Count | 3 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 530.4 KB |
| Mean | 4 |
|---|---|
| Standard Deviation | 0 |
| Median | 4 |
| Minimum | 4 |
| Maximum | 4 |
| 1st row | None |
|---|---|
| 2nd row | None |
| 3rd row | None |
| 4th row | None |
| 5th row | None |
| Count | 22584 |
|---|---|
| Lowercase Letter | 16938 |
| Space Separator | 0 |
| Uppercase Letter | 5646 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
numerical
| Distinct Count | 636 |
|---|---|
| Unique (%) | 6.9% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 795.6508 |
| Minimum | 401 |
| Maximum | 1095 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 401 |
|---|---|
| 5-th Percentile | 604 |
| Q1 | 727 |
| Median | 779 |
| Q3 | 857.75 |
| 95-th Percentile | 1030 |
| Maximum | 1095 |
| Range | 694 |
| IQR | 130.75 |
| Mean | 795.6508 |
|---|---|
| Standard Deviation | 123.3386 |
| Variance | 15212.4205 |
| Sum | 7.3407e+06 |
| Skewness | 0.259 |
| Kurtosis | -0.01066 |
| Coefficient of Variation | 0.155 |
numerical
| Distinct Count | 663 |
|---|---|
| Unique (%) | 7.2% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 0.6278 |
| Minimum | 0 |
| Maximum | 13.1 |
| Zeros | 2509 |
| Zeros (%) | 27.2% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0 |
| Q1 | 0 |
| Median | 0.09 |
| Q3 | 0.48 |
| 95-th Percentile | 3.62 |
| Maximum | 13.1 |
| Range | 13.1 |
| IQR | 0.48 |
| Mean | 0.6278 |
|---|---|
| Standard Deviation | 1.5293 |
| Variance | 2.3387 |
| Sum | 5791.69 |
| Skewness | 4.0876 |
| Kurtosis | 18.9346 |
| Coefficient of Variation | 2.4361 |
numerical
| Distinct Count | 898 |
|---|---|
| Unique (%) | 9.7% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 166.5987 |
| Minimum | 0 |
| Maximum | 15876 |
| Zeros | 18 |
| Zeros (%) | 0.2% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 9 |
| Q1 | 19 |
| Median | 57 |
| Q3 | 177.75 |
| 95-th Percentile | 543.75 |
| Maximum | 15876 |
| Range | 15876 |
| IQR | 158.75 |
| Mean | 166.5987 |
|---|---|
| Standard Deviation | 456.297 |
| Variance | 208206.9384 |
| Sum | 1.537e+06 |
| Skewness | 16.0938 |
| Kurtosis | 414.6123 |
| Coefficient of Variation | 2.7389 |
numerical
| Distinct Count | 222 |
|---|---|
| Unique (%) | 2.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 13.2371 |
| Minimum | 2.3 |
| Maximum | 15.5 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 2.3 |
|---|---|
| 5-th Percentile | 10.2 |
| Q1 | 12.7 |
| Median | 13.7 |
| Q3 | 14.3 |
| 95-th Percentile | 14.7 |
| Maximum | 15.5 |
| Range | 13.2 |
| IQR | 1.6 |
| Mean | 13.2371 |
|---|---|
| Standard Deviation | 1.581 |
| Variance | 2.4997 |
| Sum | 122125.12 |
| Skewness | -2.1664 |
| Kurtosis | 6.823 |
| Coefficient of Variation | 0.1194 |
numerical
| Distinct Count | 585 |
|---|---|
| Unique (%) | 6.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 1.3049 |
| Minimum | 0 |
| Maximum | 17.2 |
| Zeros | 892 |
| Zeros (%) | 9.7% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0 |
| Q1 | 0.2 |
| Median | 0.8 |
| Q3 | 1.7875 |
| 95-th Percentile | 4.3 |
| Maximum | 17.2 |
| Range | 17.2 |
| IQR | 1.5875 |
| Mean | 1.3049 |
|---|---|
| Standard Deviation | 1.6191 |
| Variance | 2.6216 |
| Sum | 12038.64 |
| Skewness | 2.9949 |
| Kurtosis | 15.1125 |
| Coefficient of Variation | 1.2408 |
numerical
| Distinct Count | 478 |
|---|---|
| Unique (%) | 5.2% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 2495.9756 |
| Minimum | 2253 |
| Maximum | 2746 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 2253 |
|---|---|
| 5-th Percentile | 2322 |
| Q1 | 2412 |
| Median | 2496 |
| Q3 | 2579 |
| 95-th Percentile | 2673.75 |
| Maximum | 2746 |
| Range | 493 |
| IQR | 167 |
| Mean | 2495.9756 |
|---|---|
| Standard Deviation | 107.6275 |
| Variance | 11583.6698 |
| Sum | 2.3028e+07 |
| Skewness | 0.04019 |
| Kurtosis | -0.8457 |
| Coefficient of Variation | 0.04312 |
numerical
| Distinct Count | 715 |
|---|---|
| Unique (%) | 7.8% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 0.7527 |
| Minimum | 0 |
| Maximum | 13.84 |
| Zeros | 2476 |
| Zeros (%) | 26.8% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0 |
| Q1 | 0 |
| Median | 0.12 |
| Q3 | 0.6 |
| 95-th Percentile | 4.1475 |
| Maximum | 13.84 |
| Range | 13.84 |
| IQR | 0.6 |
| Mean | 0.7527 |
|---|---|
| Standard Deviation | 1.6628 |
| Variance | 2.7651 |
| Sum | 6944.62 |
| Skewness | 3.7122 |
| Kurtosis | 15.9264 |
| Coefficient of Variation | 2.2091 |
numerical
| Distinct Count | 712 |
|---|---|
| Unique (%) | 7.7% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 120.7846 |
| Minimum | 0 |
| Maximum | 11436 |
| Zeros | 8 |
| Zeros (%) | 0.1% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 10 |
| Q1 | 20 |
| Median | 41 |
| Q3 | 119 |
| 95-th Percentile | 357 |
| Maximum | 11436 |
| Range | 11436 |
| IQR | 99 |
| Mean | 120.7846 |
|---|---|
| Standard Deviation | 362.2241 |
| Variance | 131206.2642 |
| Sum | 1.1144e+06 |
| Skewness | 14.003 |
| Kurtosis | 288.0851 |
| Coefficient of Variation | 2.9989 |
numerical
| Distinct Count | 192 |
|---|---|
| Unique (%) | 2.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 13.4802 |
| Minimum | 2.85 |
| Maximum | 15.5 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 2.85 |
|---|---|
| 5-th Percentile | 10.6 |
| Q1 | 13.1 |
| Median | 14 |
| Q3 | 14.3 |
| 95-th Percentile | 14.8 |
| Maximum | 15.5 |
| Range | 12.65 |
| IQR | 1.2 |
| Mean | 13.4802 |
|---|---|
| Standard Deviation | 1.4528 |
| Variance | 2.1105 |
| Sum | 124368.46 |
| Skewness | -2.3464 |
| Kurtosis | 7.414 |
| Coefficient of Variation | 0.1078 |
numerical
| Distinct Count | 497 |
|---|---|
| Unique (%) | 5.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 0.8141 |
| Minimum | 0 |
| Maximum | 16.9 |
| Zeros | 1218 |
| Zeros (%) | 13.2% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0 |
|---|---|
| 5-th Percentile | 0 |
| Q1 | 0.13 |
| Median | 0.4 |
| Q3 | 1 |
| 95-th Percentile | 3.2 |
| Maximum | 16.9 |
| Range | 16.9 |
| IQR | 0.87 |
| Mean | 0.8141 |
|---|---|
| Standard Deviation | 1.2243 |
| Variance | 1.4988 |
| Sum | 7510.72 |
| Skewness | 4.0448 |
| Kurtosis | 27.0551 |
| Coefficient of Variation | 1.5039 |
numerical
| Distinct Count | 279 |
|---|---|
| Unique (%) | 3.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 3.1508 |
| Minimum | 0.4 |
| Maximum | 157.2 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 0.4 |
|---|---|
| 5-th Percentile | 0.7 |
| Q1 | 0.9 |
| Median | 1.6 |
| Q3 | 3.5 |
| 95-th Percentile | 9.7 |
| Maximum | 157.2 |
| Range | 156.8 |
| IQR | 2.6 |
| Mean | 3.1508 |
|---|---|
| Standard Deviation | 5.4506 |
| Variance | 29.7093 |
| Sum | 29069 |
| Skewness | 9.9272 |
| Kurtosis | 172.5996 |
| Coefficient of Variation | 1.7299 |
numerical
| Distinct Count | 5 |
|---|---|
| Unique (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 144.2 KB |
| Mean | 2.1728 |
| Minimum | 1 |
| Maximum | 5 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
| Minimum | 1 |
|---|---|
| 5-th Percentile | 1 |
| Q1 | 2 |
| Median | 2 |
| Q3 | 2 |
| 95-th Percentile | 4 |
| Maximum | 5 |
| Range | 4 |
| IQR | 0 |
| Mean | 2.1728 |
|---|---|
| Standard Deviation | 0.8917 |
| Variance | 0.7951 |
| Sum | 20046 |
| Skewness | 1.7277 |
| Kurtosis | 3.1993 |
| Coefficient of Variation | 0.4104 |
from dataprep.eda import plot
plot(df, "Codigo resultado","Nivel de contaminacion")
from dataprep.eda import plot
plot(df, "Marca","Nivel de contaminacion")
plot(df, "Linea","Nivel de contaminacion")
plot(df, "Modelo","Nivel de contaminacion")
plot(df, "Combustible","Nivel de contaminacion")
plot(df, "Cilindraje","Nivel de contaminacion")
plot(df, "Kilometraje","Nivel de contaminacion")
plot(df, "Gases temperatura ambiente","Nivel de contaminacion")
plot(df, "Gases humedad relativa","Nivel de contaminacion")
for i in range(0,len(df)):
if df.iloc[i,3] == 'RENAULT':
df.iloc[i,3] = 1
elif df.iloc[i,3] == 'MAZDA':
df.iloc[i,3] = 2
elif df.iloc[i,3] == 'CHEVROLET':
df.iloc[i,3] = 3
elif df.iloc[i,3] == 'TOYOTA':
df.iloc[i,3] = 4
elif df.iloc[i,3] == 'HYUNDAI':
df.iloc[i,3] = 5
elif df.iloc[i,3] == 'OTRA':
df.iloc[i,3] =6
df['Marca']=pd.to_numeric(df['Marca'])
for i in range(0,len(df)):
if df.iloc[i,7] == 'GAS - GASOLINA':
df.iloc[i,7] = 1
elif df.iloc[i,7] == 'GAS NATURAL VEHICULAR':
df.iloc[i,7] = 2
elif df.iloc[i,7] == 'GASOLINA':
df.iloc[i,7] = 3
elif df.iloc[i,7] == 'GASOLINA - ELECTRICO':
df.iloc[i,7] = 4
df['Combustible']=pd.to_numeric(df['Combustible'])
print(list(df.columns))
['Numero fur', 'Fecha inicio', 'Codigo resultado', 'Marca', 'Linea', 'Modelo', 'Fecha matricula', 'Combustible', 'Cilindraje', 'Kilometraje', 'Gases temperatura ambiente', 'Gases humedad relativa', 'Gases convertidor catalitico', 'Gases ralenti rpm', 'Gases ralenti co', 'Gases ralenti hc', 'Gases ralenti co2', 'Gases ralenti o2', 'Gases crucero rpm', 'Gases crucero co', 'Gases crucero hc', 'Gases crucero co2', 'Gases crucero o2', 'Nivel de contaminacion', 'Clasificacion nivel de contaminacion']
# definir variables:
X = df[['Codigo resultado', 'Marca','Modelo','Combustible','Gases ralenti co', 'Gases ralenti hc', 'Gases ralenti co2', 'Gases ralenti o2', 'Gases crucero co', 'Gases crucero hc',
'Gases crucero co2', 'Gases crucero o2', 'Nivel de contaminacion', 'Clasificacion nivel de contaminacion']]
X.dtypes
Codigo resultado int64 Marca int64 Modelo int64 Combustible int64 Gases ralenti co float64 Gases ralenti hc float64 Gases ralenti co2 float64 Gases ralenti o2 float64 Gases crucero co float64 Gases crucero hc float64 Gases crucero co2 float64 Gases crucero o2 float64 Nivel de contaminacion float64 Clasificacion nivel de contaminacion int64 dtype: object
Centralizar datos
X_centralizada = X - X.mean()
X_correlacion = np.corrcoef(X_centralizada,rowvar=False)# Crear matriz de correlación forma 1
from dataprep.eda import plot_correlation
plot_correlation(X_centralizada)
| Pearson | Spearman | KendallTau | |
|---|---|---|---|
| Highest Positive Correlation | 0.912 | 0.926 | 0.789 |
| Highest Negative Correlation | -0.747 | -0.832 | -0.655 |
| Lowest Correlation | 0.001 | 0.007 | 0.006 |
| Mean Correlation | 0.037 | 0.066 | 0.057 |
matriz_cov = np.cov(X_centralizada, rowvar=False)
determinante = np.linalg.det(matriz_cov)
determinante
280042.1585151153
ncondicion = np.linalg.cond(matriz_cov)
ncondicion
439713670.9496324
from sklearn.covariance import LedoitWolf
cov_LW = LedoitWolf().fit(X_centralizada)
cov_LW_encogida = cov_LW.covariance_
cov_LW_shrinkage = cov_LW.shrinkage_
cov_LW_shrinkage
0.04384637018486556
determinante_LW = np.linalg.det(cov_LW_encogida)
determinante_LW
3.329682881740641e+46
ncondicion_LW = np.linalg.cond(cov_LW_encogida)
ncondicion_LW
257.15176313170514
Indentificación de atípicos a través de distancias de Mahalanobis:
X_centralizada = X_centralizada.values
Número_filas, Número_columnas=X_centralizada.shape
Transpuesta=X_centralizada.transpose()
cov_LW_encogida=cov_LW_encogida
Inverso_covarianza=np.linalg.inv(cov_LW_encogida)
MD=np.zeros((Número_filas, 1))
for i in range(1,Número_filas):
tem1=np.dot(X_centralizada[i,:],Inverso_covarianza)
tem2=np.dot(tem1,Transpuesta[:,i])
MD[i]=np.sqrt(tem2)
print('La distancia de Mahalanobis es: \n', MD)
La distancia de Mahalanobis es: [[ 0. ] [ 0.56631968] [ 0.46561107] ... [39.20725525] [27.71327871] [35.12280348]]
MD = pd.DataFrame(MD)
MD.columns = ['Distancias_Mahal']
corte = np.percentile(MD, 99.99)
corte
38.46324038107876
X_99 = []
X_99 = pd.concat([X,MD],axis=1)
X_99_corte = X_99['Distancias_Mahal'] < corte
X_99 = X_99[X_99_corte]
X_99.reset_index(drop=True, inplace=True)
X_99 = X_99.drop(['Distancias_Mahal'],axis=1)
X_99.head(2)
| Codigo resultado | Marca | Modelo | Combustible | Gases ralenti co | Gases ralenti hc | Gases ralenti co2 | Gases ralenti o2 | Gases crucero co | Gases crucero hc | Gases crucero co2 | Gases crucero o2 | Nivel de contaminacion | Clasificacion nivel de contaminacion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 6 | 2009 | 3 | 0.0 | 2.0 | 15.4 | 0.0 | 0.0 | 4.0 | 15.3 | 0.0 | 0.4 | 1 |
| 1 | 2 | 6 | 2019 | 3 | 0.0 | 0.0 | 15.1 | 0.0 | 0.0 | 0.0 | 14.8 | 0.0 | 0.4 | 1 |
plot(X_99)
| Number of Variables | 14 |
|---|---|
| Number of Rows | 9225 |
| Missing Cells | 0 |
| Missing Cells (%) | 0.0% |
| Duplicate Rows | 4 |
| Duplicate Rows (%) | 0.0% |
| Total Size in Memory | 1009.1 KB |
| Average Row Size in Memory | 112.0 B |
| Variable Types |
|
| Codigo resultado is skewed | Skewed |
|---|---|
| Marca is skewed | Skewed |
| Modelo is skewed | Skewed |
| Combustible is skewed | Skewed |
| Gases ralenti co is skewed | Skewed |
| Gases ralenti hc is skewed | Skewed |
| Gases ralenti co2 is skewed | Skewed |
| Gases ralenti o2 is skewed | Skewed |
| Gases crucero co is skewed | Skewed |
| Gases crucero hc is skewed | Skewed |
| Gases crucero co2 is skewed | Skewed |
|---|---|
| Gases crucero o2 is skewed | Skewed |
| Nivel de contaminacion is skewed | Skewed |
| Clasificacion nivel de contaminacion is skewed | Skewed |
| Gases ralenti co has 2509 (27.2%) zeros | Zeros |
| Gases ralenti o2 has 892 (9.67%) zeros | Zeros |
| Gases crucero co has 2476 (26.84%) zeros | Zeros |
| Gases crucero o2 has 1218 (13.2%) zeros | Zeros |
cov_LW99 = LedoitWolf().fit(X_99)
cov_LW99_encogida = cov_LW.covariance_
autovalor,autovector = np.linalg.eig(cov_LW99_encogida)
autovalor.round(3)
array([273441.706, 53221.611, 1149.258, 1068.475, 1066.801,
1066.473, 1063.986, 1064.053, 1063.605, 1063.57 ,
1063.495, 1063.397, 1063.365, 1063.348])
autovector.round(3)
array([[ 0. , 0. , 0.008, -0.072, -0.007, 0. , -0.029, -0.002,
-0.341, 0.225, 0.909, -0.002, 0.009, -0. ],
[-0. , -0. , 0.004, 0.076, 0.363, 0.929, -0.007, -0.01 ,
-0.013, 0.004, 0.002, 0.007, -0. , 0. ],
[-0.006, 0.001, -0.988, -0.148, -0.041, 0.032, -0.007, -0.008,
-0.002, -0.001, -0.004, -0.001, -0.001, -0. ],
[-0. , -0. , -0.003, 0.004, 0.029, -0.019, 0.001, 0.043,
-0.175, -0.222, -0.008, 0.957, -0.03 , 0.003],
[ 0.001, -0. , 0.053, -0.466, 0.278, -0.067, 0.614, -0.189,
-0.066, -0.261, 0.021, -0.061, 0.334, -0.312],
[ 0.817, -0.577, -0.006, 0.004, 0. , -0. , -0.001, -0. ,
0. , 0.001, -0. , 0. , -0. , -0.006],
[-0.002, 0. , -0.07 , 0.38 , 0.31 , -0.158, -0.382, -0.134,
-0.341, -0.375, -0.02 , -0.141, 0.536, 0.047],
[ 0.001, -0. , 0.048, -0.108, -0.645, 0.266, -0.002, 0.502,
-0.134, -0.274, 0.001, -0.074, 0.37 , -0.104],
[ 0.001, 0. , 0.065, -0.514, 0.278, -0.066, -0.549, 0.283,
-0.015, -0.255, 0.004, -0.09 , -0.312, -0.313],
[ 0.577, 0.817, -0.003, 0.004, 0. , -0. , 0.001, 0.001,
0.001, 0.001, 0. , 0. , -0. , -0.006],
[-0.001, -0.001, -0.068, 0.359, 0.155, -0.084, 0.408, 0.396,
-0.323, -0.356, 0.016, -0.184, -0.495, 0.044],
[ 0.001, 0.001, 0.036, -0.035, -0.415, 0.156, -0.088, -0.674,
-0.249, -0.367, -0.01 , -0.095, -0.348, -0.107],
[ 0.01 , 0.002, 0.059, -0.402, 0.046, 0.016, 0.011, -0.002,
-0.04 , -0.22 , 0.008, -0.06 , 0.006, 0.882],
[ 0.001, 0. , 0.032, -0.196, 0.004, 0.003, -0.003, 0.028,
-0.739, 0.491, -0.414, -0.025, -0.001, -0. ]])
#Proporción de los autovalores:
proporcion=autovalor/sum(autovalor)
proporcion
array([0.80536985, 0.156754 , 0.00338492, 0.00314699, 0.00314206,
0.00314109, 0.00313377, 0.00313396, 0.00313264, 0.00313254,
0.00313232, 0.00313203, 0.00313194, 0.00313189])
0.80532972 + 0.15696405 # 96% de la variabilidad de los datos en 2 dimensiones
0.96229377
#Se procede a extraer la columna 1 ya que es el vector de proyección de la máxima sombra, así:
vector_proyeccion=autovector[:,1]
# Forma 2 PCA
from sklearn.decomposition import PCA
# Instanciar número de componentes
n_componentes = 13 # intuición
pca = PCA(n_components = n_componentes)
pca
PCA(n_components=13)
X_PCA = pca.fit_transform(cov_LW_encogida)
X_PCA = pd.DataFrame(X_PCA)
pca.explained_variance_ratio_
array([9.57828713e-01, 4.19829520e-02, 1.95114605e-05, 1.69907148e-05,
1.69509941e-05, 1.69299280e-05, 1.68644161e-05, 1.68624063e-05,
1.68495594e-05, 1.68469367e-05, 1.68437831e-05, 1.68427289e-05,
1.68421797e-05])
sum(pca.explained_variance_ratio_)
1.0000000000000002
sum(pca.explained_variance_ratio_[:1])
0.9578287128465874
X_99.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9225 entries, 0 to 9224 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Codigo resultado 9225 non-null int64 1 Marca 9225 non-null int64 2 Modelo 9225 non-null int64 3 Combustible 9225 non-null int64 4 Gases ralenti co 9225 non-null float64 5 Gases ralenti hc 9225 non-null float64 6 Gases ralenti co2 9225 non-null float64 7 Gases ralenti o2 9225 non-null float64 8 Gases crucero co 9225 non-null float64 9 Gases crucero hc 9225 non-null float64 10 Gases crucero co2 9225 non-null float64 11 Gases crucero o2 9225 non-null float64 12 Nivel de contaminacion 9225 non-null float64 13 Clasificacion nivel de contaminacion 9225 non-null int64 dtypes: float64(9), int64(5) memory usage: 1009.1 KB
X = X_99[[ 'Marca','Modelo','Combustible','Gases ralenti co', 'Gases ralenti hc', 'Gases ralenti co2', 'Gases ralenti o2', 'Gases crucero co', 'Gases crucero hc',
'Gases crucero co2', 'Gases crucero o2']]
y = X_99[['Clasificacion nivel de contaminacion']]
print(X.shape, y.shape)
(9225, 11) (9225, 1)
from sklearn.model_selection import train_test_split
# separar los datos en train y eval
x_train, x_eval, y_train, y_eval = train_test_split(X, y, test_size=0.20, train_size=0.80, random_state=0)
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as mpatches
import seaborn as sb
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
scaler = MinMaxScaler() #para normalizar
x_train = scaler.fit_transform(x_train)
x_eval = scaler.transform(x_eval)
import warnings
warnings.filterwarnings('ignore')
from matplotlib import style
k_range = range(1, 30)
scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(x_train, y_train)
scores.append(knn.score(x_eval, y_eval))
plt.figure()
fig, ax = plt.subplots(figsize=(6, 3.84));
plt.xlabel('k')
plt.ylabel('accuracy')
plt.scatter(k_range, scores)
plt.xticks([0,5,10,15,30]);
<Figure size 1152x648 with 0 Axes>
n_neighbors = 7
knn = KNeighborsClassifier(n_neighbors)
knn.fit(x_train, y_train)
print('Precisión del clasificador K-NN en el conjunto de entrenamiento: {:.2f}'
.format(knn.score(x_train, y_train)))
print('Precisión del clasificador K-NN en el conjunto de prueba: {:.2f}'
.format(knn.score(x_eval, y_eval)))
Precisión del clasificador K-NN en el conjunto de entrenamiento: 0.85 Precisión del clasificador K-NN en el conjunto de prueba: 0.83
pred = knn.predict(x_eval)
print('\n')
print(confusion_matrix(y_eval, pred))
print('\n')
print(classification_report(y_eval, pred))
[[ 172 70 0 0 0]
[ 86 1237 5 5 1]
[ 0 57 17 10 2]
[ 0 33 15 49 10]
[ 0 15 3 8 50]]
precision recall f1-score support
1 0.67 0.71 0.69 242
2 0.88 0.93 0.90 1334
3 0.42 0.20 0.27 86
4 0.68 0.46 0.55 107
5 0.79 0.66 0.72 76
accuracy 0.83 1845
macro avg 0.69 0.59 0.63 1845
weighted avg 0.81 0.83 0.82 1845
confusion_matrixV = confusion_matrix(y_eval, pred)
confusion_matrixV = pd.DataFrame(confusion_matrixV)
verdaderos_positivos = np.diag(confusion_matrixV)
falsos_positivos = confusion_matrixV.sum(axis=0) - verdaderos_positivos
falsos_negativos = confusion_matrixV.sum(axis=1) - verdaderos_positivos
verdaderos_negativos = (confusion_matrixV.to_numpy().sum() - (verdaderos_positivos + falsos_positivos + falsos_negativos))
confusion_matrixV.columns = ['Nivel 1','Nivel 2','Nivel 3','Nivel 4','Nivel 5']
for dato, vp, fp, vn, fn in zip(
confusion_matrixV.columns,
verdaderos_positivos, falsos_positivos,
verdaderos_negativos, falsos_negativos):
frame = pd.DataFrame.from_dict(
{"Positive(1)": (vp, fp), "Negative(0)": (fn, vn)},
orient="Index",
columns=("Positive(1)", "Negative(0)"))
frame.index.name = f'Predicted Values for "{dato}"'
frame.columns.name = f'Actual values for "{dato}"'
print(frame, "\n")
Actual values for "Nivel 1" Positive(1) Negative(0) Predicted Values for "Nivel 1" Positive(1) 172 86 Negative(0) 70 1517 Actual values for "Nivel 2" Positive(1) Negative(0) Predicted Values for "Nivel 2" Positive(1) 1237 175 Negative(0) 97 336 Actual values for "Nivel 3" Positive(1) Negative(0) Predicted Values for "Nivel 3" Positive(1) 17 23 Negative(0) 69 1736 Actual values for "Nivel 4" Positive(1) Negative(0) Predicted Values for "Nivel 4" Positive(1) 49 23 Negative(0) 58 1715 Actual values for "Nivel 5" Positive(1) Negative(0) Predicted Values for "Nivel 5" Positive(1) 50 13 Negative(0) 26 1756
X = X_99[[ 'Marca','Modelo','Combustible','Gases ralenti co', 'Gases ralenti hc', 'Gases ralenti co2', 'Gases ralenti o2', 'Gases crucero co', 'Gases crucero hc',
'Gases crucero co2', 'Gases crucero o2']]
y = X_99[['Clasificacion nivel de contaminacion']]
print(X.shape, y.shape)
(9225, 11) (9225, 1)
from sklearn.model_selection import train_test_split
# separar los datos en train y eval
x_train, x_eval, y_train, y_eval = train_test_split(X, y, test_size=0.20, train_size=0.80, random_state=0)
from sklearn.tree import DecisionTreeClassifier
# creando el modelo sin control de profundidad, crecerá hasta que todas las hojas sean puras
arbol = DecisionTreeClassifier(criterion = 'gini')
# ajustando el modelo
arbol.fit(x_train, y_train)
DecisionTreeClassifier()
from sklearn.tree import export_graphviz
from graphviz import Source
import os
# estructura del árbol creado en el paso anterior:
print(f"Profundidad del árbol: {arbol.get_depth()}")
print(f"Número de nodos terminales: {arbol.get_n_leaves()}")
Profundidad del árbol: 17 Número de nodos terminales: 345
# precisión del modelo en datos de entrenamiento.
print("precisión entranamiento: {0: .2f}".format(
arbol.score(x_train, y_train)))
precisión entranamiento: 1.00
# precisión del modelo en datos de evaluación.
print("precisión evaluación: {0: .2f}".format(arbol.score(x_eval, y_eval)))
precisión evaluación: 0.93
# profundidad del arbol de decisión
arbol.tree_.max_depth
17
Debemos reducir la complejidad del modelo para intentar ganar en generalización. También debemos tener en cuenta que si reducimos demasiado la complejidad, podemos crear un modelo demasiado simple que en vez de estar sobreajustado puede tener un desempeño muy por debajo del que podría tener; podríamos decir que el modelo estaría infraajustado y tendría un alto nivel de sesgo. Para ayudarnos a encontrar el término medio entre la complejidad del modelo y su ajuste a los datos, podemos ayudarnos de herramientas gráficas. Por ejemplo podríamos crear diferentes modelos, con distintos grados de complejidad y luego graficar la precisión en función de la complejidad.
import matplotlib.pyplot as plt
# Gráfico de ajuste del árbol de decisión
#fig, ax = plt.subplots(figsize=(10, 5))
fig, ax = plt.subplots(figsize=(6, 3.84));
train_prec = []
eval_prec = []
max_deep_list = list(range(1,30))
for deep in max_deep_list:
arbol3 = DecisionTreeClassifier(criterion='gini', max_depth=deep)
arbol3.fit(x_train, y_train)
train_prec.append(arbol3.score(x_train, y_train))
eval_prec.append(arbol3.score(x_eval, y_eval))
# graficar los resultados.
plt.plot(max_deep_list, train_prec, color='r', label='entrenamiento')
plt.plot(max_deep_list, eval_prec, color='b', label='evaluación')
plt.title('Gráfico de ajuste árbol de decisión')
plt.legend()
plt.ylabel('precisión')
plt.xlabel('profundidad del árbol / complejidad')
plt.show()
El gráfico que acabamos de construir muestra la precisión del modelo en función de su complejidad. Podemos observar que el punto con mayor precisión, en los datos de evaluación, lo obtenemos con un nivel de profundidad menor a 10.
# modelo dos, con control de profundidad
arbol2 = DecisionTreeClassifier(criterion='gini', max_depth=4)
# Ajustando el modelo
arbol2.fit(x_train, y_train)
DecisionTreeClassifier(max_depth=4)
from sklearn.tree import plot_tree
import pydotplus
# Estructura del árbol creado
# ------------------------------------------------------------------------------
fig, ax = plt.subplots(figsize=(30, 5))
print(f"Profundidad del árbol: {arbol2.get_depth()}")
print(f"Número de nodos terminales: {arbol2.get_n_leaves()}")
plot = plot_tree(
decision_tree = arbol2,
feature_names = X.columns,
class_names = ['Nivel 1','Nivel 2','Nivel 3','Nivel 4','Nivel 5'],
filled = True,
impurity = False,
fontsize = 8,
ax = ax
)
Profundidad del árbol: 4 Número de nodos terminales: 16
# guardar png del árbol2:
import pydotplus
print(f"Profundidad del árbol: {arbol2.get_depth()}")
print(f"Número de nodos terminales: {arbol2.get_n_leaves()}")
labels = list(X.columns)
data = export_graphviz(decision_tree = arbol2,
feature_names = X.columns,
class_names = ['Nivel 1','Nivel 2','Nivel 3','Nivel 4','Nivel 5'],
rounded = True,
filled = True)
graph = pydotplus.graph_from_dot_data(data)
graph.write_png('Árbol_decisión_CDA.png')
Profundidad del árbol: 4 Número de nodos terminales: 16
True
from sklearn.tree import export_text
texto_modelo = export_text(decision_tree = arbol2, feature_names = list(X.columns))
print(texto_modelo)
|--- Gases ralenti hc <= 349.50 | |--- Gases ralenti hc <= 17.50 | | |--- Gases ralenti o2 <= 0.44 | | | |--- Gases crucero hc <= 18.50 | | | | |--- class: 1 | | | |--- Gases crucero hc > 18.50 | | | | |--- class: 2 | | |--- Gases ralenti o2 > 0.44 | | | |--- Gases ralenti o2 <= 0.81 | | | | |--- class: 2 | | | |--- Gases ralenti o2 > 0.81 | | | | |--- class: 2 | |--- Gases ralenti hc > 17.50 | | |--- Gases crucero hc <= 221.50 | | | |--- Gases crucero co <= 2.30 | | | | |--- class: 2 | | | |--- Gases crucero co > 2.30 | | | | |--- class: 2 | | |--- Gases crucero hc > 221.50 | | | |--- Gases crucero co2 <= 11.65 | | | | |--- class: 4 | | | |--- Gases crucero co2 > 11.65 | | | | |--- class: 3 |--- Gases ralenti hc > 349.50 | |--- Gases crucero co2 <= 11.75 | | |--- Gases ralenti hc <= 533.00 | | | |--- Gases ralenti co <= 5.76 | | | | |--- class: 4 | | | |--- Gases ralenti co > 5.76 | | | | |--- class: 5 | | |--- Gases ralenti hc > 533.00 | | | |--- Gases crucero hc <= 192.50 | | | | |--- class: 5 | | | |--- Gases crucero hc > 192.50 | | | | |--- class: 5 | |--- Gases crucero co2 > 11.75 | | |--- Gases ralenti hc <= 469.00 | | | |--- Gases crucero hc <= 286.00 | | | | |--- class: 3 | | | |--- Gases crucero hc > 286.00 | | | | |--- class: 4 | | |--- Gases ralenti hc > 469.00 | | | |--- Gases ralenti hc <= 1078.00 | | | | |--- class: 4 | | | |--- Gases ralenti hc > 1078.00 | | | | |--- class: 5
# precisión del modelo en datos de entrenamiento.
print("precisión entrenamiento: {0: .2f}".format(
arbol2.score(x_train, y_train)))
precisión entrenamiento: 0.89
# precisión del modelo en datos de evaluación.
print("precisión evaluación: {0: .2f}".format(
arbol2.score(x_eval, y_eval)))
precisión evaluación: 0.88
y_pred = arbol2.predict(x_eval)
print(confusion_matrix(y_eval, y_pred))
[[ 172 70 0 0 0] [ 15 1288 30 1 0] [ 0 28 43 15 0] [ 0 16 14 70 7] [ 0 0 1 17 58]]
matriz_confusion = confusion_matrix(y_eval, y_pred)
matriz_confusion = pd.DataFrame(matriz_confusion)
verdaderos_positivos = np.diag(matriz_confusion)
falsos_positivos = matriz_confusion.sum(axis=0) - verdaderos_positivos
falsos_negativos = matriz_confusion.sum(axis=1) - verdaderos_positivos
verdaderos_negativos = (matriz_confusion.to_numpy().sum() - (verdaderos_positivos + falsos_positivos + falsos_negativos))
matriz_confusion.columns = ['Nivel 1','Nivel 2','Nivel 3','Nivel 4','Nivel 5']
for dato, vp, fp, vn, fn in zip(
matriz_confusion.columns,
verdaderos_positivos, falsos_positivos,
verdaderos_negativos, falsos_negativos):
frame = pd.DataFrame.from_dict(
{"Positive(1)": (vp, fp), "Negative(0)": (fn, vn)},
orient="Index",
columns=("Positive(1)", "Negative(0)"))
frame.index.name = f'Predicted Values for "{dato}"'
frame.columns.name = f'Actual values for "{dato}"'
print(frame, "\n")
Actual values for "Nivel 1" Positive(1) Negative(0) Predicted Values for "Nivel 1" Positive(1) 172 15 Negative(0) 70 1588 Actual values for "Nivel 2" Positive(1) Negative(0) Predicted Values for "Nivel 2" Positive(1) 1288 114 Negative(0) 46 397 Actual values for "Nivel 3" Positive(1) Negative(0) Predicted Values for "Nivel 3" Positive(1) 43 45 Negative(0) 43 1714 Actual values for "Nivel 4" Positive(1) Negative(0) Predicted Values for "Nivel 4" Positive(1) 70 33 Negative(0) 37 1705 Actual values for "Nivel 5" Positive(1) Negative(0) Predicted Values for "Nivel 5" Positive(1) 58 7 Negative(0) 18 1762
from sklearn.metrics import classification_report
print(classification_report(y_eval, y_pred));
precision recall f1-score support
1 0.92 0.71 0.80 242
2 0.92 0.97 0.94 1334
3 0.49 0.50 0.49 86
4 0.68 0.65 0.67 107
5 0.89 0.76 0.82 76
accuracy 0.88 1845
macro avg 0.78 0.72 0.75 1845
weighted avg 0.88 0.88 0.88 1845
Otra herramienta analítica que nos ayuda a entender como reducimos el Sobreajuste con la ayuda de más datos, son las curvas de aprendizaje, las cuales grafican la precisión en función del tamaño de los datos de entrenamiento.
# Curvas de aprendizaje
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(estimator=arbol, X=x_train, y=y_train,
train_sizes=np.linspace(0.1, 1.0, 10),
n_jobs=-1)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
# graficando las curvas
#fig, ax = plt.subplots(figsize=(10, 5))
fig, ax = plt.subplots(figsize=(6, 3.84));
plt.plot(train_sizes, train_mean, color='r', marker='o', markersize=5,
label='entrenamiento')
plt.fill_between(train_sizes, train_mean + train_std,
train_mean - train_std, alpha=0.15, color='r')
plt.plot(train_sizes, test_mean, color='b', linestyle='--',
marker='s', markersize=5, label='evaluacion')
plt.fill_between(train_sizes, test_mean + test_std,
test_mean - test_std, alpha=0.15, color='b')
plt.grid()
plt.title('Curva de aprendizaje')
plt.legend(loc='best')
plt.xlabel('Tamaño')
plt.ylabel('Precisión')
plt.show()
En este gráfico podemos ver claramente como con pocos datos, la precisión entre los datos de entrenamiento y los de evaluación son muy distintas y luego a medida que la cantidad de datos va aumentando, el modelo puede generalizar mucho mejor y las precisiones se comienzan a emparejar.
# Gráfico de importancia de los predictores en los modelos: arbol y arbol2
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
style.use('ggplot') or plt.style.use('ggplot')
# visualizar importancia con el modelo 'arbol':
fig, ax = plt.subplots(figsize=(20, 6))
plt.subplot(2, 1, 1)
x = X.columns
y1 = arbol.feature_importances_ #importancia de cada variable predictora
sns.barplot(x=x, y=y1)
plt.title('Importancia de cada Predictor con el modelo: "arbol"')
# visualizar importancia con el modelo 'arbol2':
fig, ax = plt.subplots(figsize=(20, 6))
plt.subplot(2, 1, 2)
y2 = arbol2.feature_importances_ #importancia de cada variable predictora
sns.barplot(x=x, y=y2)
plt.title('Importancia de cada Predictor con el modelo: "arbol2"')
plt.show()
# predictores del modelo "arbol" vs arbol2" en orden de importancia:
importancia = pd.DataFrame({'Predictor':list(X.columns),'Importancia_arbol':np.round(arbol.feature_importances_,3),'Importancia_arbol2':np.round(arbol2.feature_importances_,3)})
importancia = importancia.sort_values('Importancia_arbol2',ascending=False)
importancia
| Predictor | Importancia_arbol | Importancia_arbol2 | |
|---|---|---|---|
| 4 | Gases ralenti hc | 0.427 | 0.547 |
| 6 | Gases ralenti o2 | 0.168 | 0.207 |
| 8 | Gases crucero hc | 0.151 | 0.129 |
| 9 | Gases crucero co2 | 0.065 | 0.063 |
| 7 | Gases crucero co | 0.070 | 0.040 |
| 3 | Gases ralenti co | 0.054 | 0.014 |
| 0 | Marca | 0.003 | 0.000 |
| 1 | Modelo | 0.010 | 0.000 |
| 2 | Combustible | 0.001 | 0.000 |
| 5 | Gases ralenti co2 | 0.030 | 0.000 |
| 10 | Gases crucero o2 | 0.021 | 0.000 |
from mlxtend.plotting import plot_decision_regions
# regiones de decisión en 2D con las variables más representativas del modelo final ("arbol2"):
X= pd.concat([X_99['Gases ralenti hc'],X_99['Gases ralenti o2']], axis=1)
y = X_99['Clasificacion nivel de contaminacion']
X2 = X.iloc[:,[1,0]].values # variable 1 = Gases ralenti hc ; variable 3 = Gases ralenti o2
y2 = y.iloc[:].values # variable respuesta = Clasificacion nivel de contaminacion
arbol2.fit(X2,y2) # usamos el arbol2 resultante en el modelo final
fig, ax = plt.subplots(figsize=(10, 7))
plot_decision_regions(X2, y2, clf=arbol2, legend=2)
plt.xlabel('Gases ralenti hc')
plt.ylabel('Gases ralenti o2')
plt.title('Clasificación de gases hc y o2 con respecto al nivel de contaminación')
plt.show()
X = X_99[[ 'Marca','Modelo','Combustible','Gases ralenti co', 'Gases ralenti hc', 'Gases ralenti co2', 'Gases ralenti o2', 'Gases crucero co', 'Gases crucero hc',
'Gases crucero co2', 'Gases crucero o2']]
y = X_99[['Clasificacion nivel de contaminacion']]
print(X.shape, y.shape)
(9225, 11) (9225, 1)
from sklearn.model_selection import train_test_split
# separar los datos en train y eval
x_train, x_eval, y_train, y_eval = train_test_split(X, y, test_size=0.20, train_size=0.80, random_state=0)
from sklearn.svm import SVC
algoritmo=SVC(kernel='linear')
algoritmo.fit(x_train, y_train)
SVC(kernel='linear')
y_pred=algoritmo.predict(x_eval)
matriz_SVC=confusion_matrix(y_eval,y_pred)
print('Matriz de confusión: ')
print(matriz_SVC)
Matriz de confusión: [[ 232 10 0 0 0] [ 12 1322 0 0 0] [ 0 1 84 1 0] [ 0 0 0 105 2] [ 0 0 0 0 76]]
print('Precisión del clasificador en el conjunto de entrenamiento: {:.2f}'
.format(algoritmo.score(x_train, y_train)))
print('Precisión del clasificador en el conjunto de prueba: {:.2f}'
.format(algoritmo.score(x_eval, y_eval)))
Precisión del clasificador en el conjunto de entrenamiento: 0.98 Precisión del clasificador en el conjunto de prueba: 0.99
X = X_99[[ 'Marca','Modelo','Combustible','Gases ralenti co', 'Gases ralenti hc', 'Gases ralenti co2', 'Gases ralenti o2', 'Gases crucero co', 'Gases crucero hc',
'Gases crucero co2', 'Gases crucero o2']]
y = X_99[['Clasificacion nivel de contaminacion']]
from sklearn.model_selection import train_test_split
# separar los datos en train y eval
x_train, x_eval, y_train, y_eval = train_test_split(X, y, test_size=0.20, train_size=0.80, random_state=0)
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='adam', alpha=0.0001,
hidden_layer_sizes=(10, 10, 10,10,10), random_state=0, max_iter=500, tol=0.0000000001)
clf.fit(x_train, y_train)
predictions=clf.predict(x_eval)
from sklearn.metrics import classification_report
print(classification_report(y_eval,predictions));
precision recall f1-score support
1 0.84 0.96 0.90 242
2 0.99 0.97 0.98 1334
3 0.99 0.88 0.93 86
4 0.94 0.94 0.94 107
5 0.94 1.00 0.97 76
accuracy 0.96 1845
macro avg 0.94 0.95 0.94 1845
weighted avg 0.97 0.96 0.96 1845
X = X_99[[ 'Marca','Modelo','Combustible','Gases ralenti co', 'Gases ralenti hc', 'Gases ralenti co2', 'Gases ralenti o2', 'Gases crucero co', 'Gases crucero hc',
'Gases crucero co2', 'Gases crucero o2']]
y = X_99[['Clasificacion nivel de contaminacion']]
from sklearn.model_selection import train_test_split
# separar los datos en train y eval
x_train, x_eval, y_train, y_eval = train_test_split(X, y, test_size=0.20, train_size=0.80, random_state=0)
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.anova import anova_lm
from scipy import stats
# A la matriz de predictores se le tiene que añadir una columna de 1s para el intercept del modelo
x_train_reg = sm.add_constant(x_train, prepend=True)
modelo = sm.OLS(endog=y_train, exog=x_train_reg,)
modelo = modelo.fit()
print(modelo.summary())
OLS Regression Results
================================================================================================
Dep. Variable: Clasificacion nivel de contaminacion R-squared: 0.697
Model: OLS Adj. R-squared: 0.696
Method: Least Squares F-statistic: 1537.
Date: Mon, 07 Jun 2021 Prob (F-statistic): 0.00
Time: 19:54:22 Log-Likelihood: -5264.6
No. Observations: 7380 AIC: 1.055e+04
Df Residuals: 7368 BIC: 1.064e+04
Df Model: 11
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
const 10.4454 1.459 7.160 0.000 7.586 13.305
Marca -0.0113 0.003 -3.554 0.000 -0.017 -0.005
Modelo -0.0036 0.001 -5.008 0.000 -0.005 -0.002
Combustible 0.1124 0.024 4.726 0.000 0.066 0.159
Gases ralenti co 0.1364 0.015 9.340 0.000 0.108 0.165
Gases ralenti hc 0.0004 2.07e-05 17.245 0.000 0.000 0.000
Gases ralenti co2 -0.0577 0.022 -2.604 0.009 -0.101 -0.014
Gases ralenti o2 0.0474 0.016 3.020 0.003 0.017 0.078
Gases crucero co 0.1347 0.014 9.808 0.000 0.108 0.162
Gases crucero hc 0.0003 2.32e-05 11.460 0.000 0.000 0.000
Gases crucero co2 -0.0634 0.021 -3.020 0.003 -0.105 -0.022
Gases crucero o2 0.0233 0.016 1.483 0.138 -0.007 0.054
==============================================================================
Omnibus: 785.199 Durbin-Watson: 1.996
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6818.606
Skew: 0.078 Prob(JB): 0.00
Kurtosis: 7.706 Cond. No. 5.11e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.11e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
# Diagnóstico errores (residuos) de las predicciones de entrenamiento
y_train = y_train.values
y_train=y_train.flatten()
prediccion_train = modelo.predict(exog = x_train_reg)
residuos_train = prediccion_train - y_train
import seaborn as sns
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(9, 8))
axes[0, 0].scatter(y_train, prediccion_train, edgecolors=(0, 0, 0), alpha = 0.4)
axes[0, 0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()],
'k--', color = 'black', lw=2)
axes[0, 0].set_title('Valor predicho vs valor real', fontsize = 10, fontweight = "bold")
axes[0, 0].set_xlabel('Real')
axes[0, 0].set_ylabel('Predicción')
axes[0, 0].tick_params(labelsize = 7)
axes[0, 1].scatter(list(range(len(y_train))), residuos_train,
edgecolors=(0, 0, 0), alpha = 0.4)
axes[0, 1].axhline(y = 0, linestyle = '--', color = 'black', lw=2)
axes[0, 1].set_title('Residuos del modelo', fontsize = 10, fontweight = "bold")
axes[0, 1].set_xlabel('id')
axes[0, 1].set_ylabel('Residuo')
axes[0, 1].tick_params(labelsize = 7)
sns.histplot(
data = residuos_train,
stat = "density",
kde = True,
line_kws= {'linewidth': 1},
color = "firebrick",
alpha = 0.3,
ax = axes[1, 0]
)
axes[1, 0].set_title('Distribución residuos del modelo', fontsize = 10,
fontweight = "bold")
axes[1, 0].set_xlabel("Residuo")
axes[1, 0].tick_params(labelsize = 7)
sm.qqplot(
residuos_train,
fit = True,
line = 'q',
ax = axes[1, 1],
color = 'firebrick',
alpha = 0.4,
lw = 2
)
axes[1, 1].set_title('Q-Q residuos del modelo', fontsize = 10, fontweight = "bold")
axes[1, 1].tick_params(labelsize = 7)
axes[2, 0].scatter(prediccion_train, residuos_train,
edgecolors=(0, 0, 0), alpha = 0.4)
axes[2, 0].axhline(y = 0, linestyle = '--', color = 'black', lw=2)
axes[2, 0].set_title('Residuos del modelo vs predicción', fontsize = 10, fontweight = "bold")
axes[2, 0].set_xlabel('Predicción')
axes[2, 0].set_ylabel('Residuo')
axes[2, 0].tick_params(labelsize = 7)
# Se eliminan los axes vacíos
fig.delaxes(axes[2,1])
fig.tight_layout()
plt.subplots_adjust(top=0.9)
fig.suptitle('Diagnóstico residuos', fontsize = 12, fontweight = "bold");
# Normalidad de los residuos Shapiro-Wilk test
# ==============================================================================
shapiro_test = stats.shapiro(residuos_train);
shapiro_test
ShapiroResult(statistic=0.9231351613998413, pvalue=0.0)
# Normalidad de los residuos D'Agostino's K-squared test
# ==============================================================================
k2, p_value = stats.normaltest(residuos_train)
print(f"Estadítico= {k2}, p-value = {p_value}")
Estadítico= 785.1989262249446, p-value = 3.134870262040877e-171
# Error de test del modelo
from sklearn.metrics import mean_squared_error
# ==============================================================================
x_eval = sm.add_constant(x_eval, prepend=True)
predicciones = modelo.predict(exog = x_eval)
rmse = mean_squared_error(
y_true = y_eval,
y_pred = predicciones,
squared = False
)
print("")
print(f"El error (rmse) de test es: {rmse}")
El error (rmse) de test es: 0.5175718718686321
prueba = pd.read_excel('prueba_RN.xlsx')
prueba
# KNN:
P_KNN = knn.predict(prueba)
P_KNN = pd.DataFrame(P_KNN)
P_KNN.to_excel('P_KNN.xlsx')
# ÁRBOL DECISIÓN:
P_DT = arbol2.predict(prueba)
P_DT = pd.DataFrame(P_DT)
P_DT.to_excel('P_DT.xlsx')
# MÁQUINA VECTOR SOPORTE:
P_SVM = algoritmo.predict(prueba)
P_SVM = pd.DataFrame(P_SVM)
P_SVM.to_excel('P_SVM.xlsx')
# RED NEURONAL:
P_NN = clf.predict(prueba)
P_NN = pd.DataFrame(P_NN)
P_NN.to_excel('P_NN.xlsx')
Xf = X_99[[ 'Gases ralenti hc','Gases ralenti o2', 'Gases crucero hc']]
yf = X_99[['Clasificacion nivel de contaminacion']]
# separar los datos en train y eval
x_train, x_eval, y_train, y_eval = train_test_split(Xf, yf, test_size=0.20, train_size=0.80, random_state=0)
arbol_f = DecisionTreeClassifier(criterion='gini', max_depth=4)
# Ajustando el modelo
arbol_f.fit(x_train, y_train)
print('Evaluación del modelo de Árboles de decisión con las 3 variables más importantes: ')
print('---------------------------------------------------------------------------------')
# precisión del modelo en datos de entrenamiento.
print("Precisión entrenamiento: {0: .2f}".format(
arbol_f.score(x_train, y_train)))
# precisión del modelo en datos de evaluación.
print("Precisión evaluación: {0: .2f}".format(
arbol_f.score(x_eval, y_eval)))
print('\n')
# Gráfico de importancia de los 3 predictores más importantes:
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
style.use('ggplot') or plt.style.use('ggplot')
# visualizar importancia con el modelo 'arbol_f':
fig, ax = plt.subplots(figsize=(6, 3.84));
x = Xf.columns
y1 = arbol_f.feature_importances_ #importancia de cada variable predictora
sns.barplot(x=x, y=y1)
plt.title('Importancia de cada predictor con el modelo: "arbol_f"')
plt.show()
Evaluación del modelo de Árboles de decisión con las 3 variables más importantes: --------------------------------------------------------------------------------- Precisión entrenamiento: 0.88 Precisión evaluación: 0.87
from mlxtend.plotting import plot_decision_regions
# regiones de decisión en 2D con las variables más representativas del modelo final ("arbol2"):
X= pd.concat([X_99['Gases ralenti hc'],X_99['Gases ralenti o2']], axis=1)
y = X_99['Clasificacion nivel de contaminacion']
X2 = X.iloc[:,[1,0]].values # variable 1 = Gases ralenti hc ; variable 3 = Gases ralenti o2
y2 = y.iloc[:].values # variable respuesta = Clasificacion nivel de contaminacion
arbol_f.fit(X2,y2) # usamos el arbol2 resultante en el modelo final
fig, ax = plt.subplots(figsize=(10, 7))
plot_decision_regions(X2, y2, clf=arbol2, legend=2)
plt.xlabel('Gases ralenti hc')
plt.ylabel('Gases ralenti o2')
plt.title('Clasificación de gases ralenti hc y o2 con respecto al nivel de contaminación')
plt.show()